Efficient Filtering on Hidden Document Streams
نویسندگان
چکیده
Many online services like Twitter and GNIP offer streaming programming interfaces that allow real-time information filtering based on keyword or other conditions. However, all these services specify strict access constraints, or charge a cost based on the usage. We refer to such streams as “hidden streams” to draw a parallel to the well-studied hidden Web, which similarly restricts access to the contents of a database through a querying interface. At the same time, the users’ interest is often captured by complex classification models that, implicitly or explicitly, specify hundreds of keyword-based rules, along with the rules’ accuracies. In this paper, we study how to best utilize a constrained streaming access interface to maximize the number of retrieved relevant items, with respect to a classifier, expressed as a set of rules. We consider two problem variants. The static version assumes that the popularity of the keywords is known and constant across time. The dynamic version lifts this assumption, and can be viewed as an exploration-vs.-exploitation problem. We show that both problems are NP-hard, and propose exact and bounded approximation algorithms for various settings, including various access constraint types. We experimentally evaluate our algorithms on real Twitter data.
منابع مشابه
Dimensionality Reduction and Filtering on Time Series Sensor Streams
This chapter surveys fundamental tools for dimensionality reduction and filtering of time series streams, illustrating what it takes to apply them efficiently and effectively to numerous problems. In particular, we show how least-squares based techniques (auto-regression and principal component analysis) can be successfully used to discover correlations both across streams, as well as across ti...
متن کاملOnline Filtering and Uncertainty Management Techniques for RFID Data Processing
RFID is one of the emerging technologies for a wide-range of applications, including supply chain and asset management, healthcare and intruder localization. However, the nature of an RFID data stream is noisy, redundant and unreliable, making it unsuitable for direct use in applications. In this paper, we propose specific RFID Online Filtering and Uncertainty Management techniques that operate...
متن کاملUsing temporal IDF for efficient novelty detection in text streams
Novelty detection in text streams is a challenging task that emerges in quite a few different scenarios, ranging from email thread filtering to RSS news feed recommendation on a smartphone. An efficient novelty detection algorithm can save the user a great deal of time and resources when browsing through relevant yet usually previously-seen content. Most of the recent research on detection of n...
متن کاملDimensionality Reduction and Forecasting on Streams
We consider the problem of capturing correlations and finding hidden variables corresponding to trends on collections of time series streams. Our proposed method, SPIRIT, can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream collection. It can do this quickly, with no buffering of stream values and without comparing pairs of streams. Moreo...
متن کاملOnline Dictionary Matching for Streams of XML Documents
We consider the online multiple-pattern matching problem for streams of XML documents, when the patterns are expressed as linear XPath expressions containing child operators (/), descendant operators (//) and wildcards (∗) but no predicates. For each document in the stream, the task is to determine all occurrences in the document of all the patterns. We present a general multiple-pattern-matchi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014